Marketing Analytics Process

Continuous Data

Remember that summarizing data is initially all about discovery, the heart of exploratory data analysis.

  • Computing statistics (i.e., numerical summaries).
  • Visualizing data (i.e., graphical summaries).

How we summarize depends on whether the data is discrete or continuous.

  • Continuous means “forming an unbroken whole; without interruption.”
  • Continuous data are also called quantitative or numeric.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

What variables are continuous? What are their data types?

customer_data <- read_csv("customer_data.csv")
## Rows: 10531 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, married, college_degree, region, state, review_time, review...
## dbl (6): customer_id, birth_year, income, credit, review_id, star_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summarize Continuous Data

One common statistic for a continuous variable is a mean.

customer_data |>
  summarize(avg_income = mean(income))
## # A tibble: 1 × 1
##   avg_income
##        <dbl>
## 1    138623.

Note that summarize() is more general than count() and can accommodate all sort of calculations - similarly to mutate(). What is the main difference between summarize() and mutate()?

  • Compute the mean of both income and credit.
  • We can also compute the mode, median, variance, standard deviation, minimum, maximum, sum, etc.

customer_data |>
  summarize(
    avg_income = mean(income),
    avg_credit = mean(credit)
  )
## # A tibble: 1 × 2
##   avg_income avg_credit
##        <dbl>      <dbl>
## 1    138623.       667.

Visualize Data

{ggplot2} provides a consistent grammar of graphics built with layers.

  1. Data – Data to visualize.
  2. Aesthetics – Mapping graphical elements to data.
  3. Geometry – Or “geom,” the graphic representing the data.
  4. Facets, Labels, Scales, etc.

Visualize Continuous Data

Let’s plot the distribution of income.

customer_data |> 
  ggplot(aes(x = income)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualize the relationship between income and credit.

customer_data |> 
  ggplot(aes(x = income, y = credit)) +
  geom_point()

Visualize the relationship between star_rating and income.

customer_data |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_point()

## Warning: Removed 7372 rows containing missing values or values outside the scale range
## (`geom_point()`).

What do we do if there is overplotting? There’s a geom for that (geom_jitter()).

  • Drop the missing data before plotting.
  • Play with the size and alpha geom arguments.
  • Add a geom_smooth() layer.
  • How could we look at this same plot by region?
  • Complete the plot with labels.

customer_data |> 
  drop_na(star_rating) |> 
  ggplot(aes(x = star_rating, y = income)) +
  geom_jitter(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ region) +
  labs(
    title = "Relationship Between Star Rating and Income by Region",
    x = "Star Rating",
    y = "Income"
  )

## `geom_smooth()` using formula = 'y ~ x'

Summarize Continuous and Discrete Data

Grouped summaries provide a powerful solution for computing continuous statistics by discrete categories.

customer_data |>
  group_by(gender) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  )
## # A tibble: 3 × 4
##   gender     n avg_income avg_credit
##   <chr>  <int>      <dbl>      <dbl>
## 1 Female  5219    130685.       668.
## 2 Male    4214    146861.       666.
## 3 Other   1098    144735.       665.

Note how the group_by() function is a lot like the facet_wrap(), it filters the data by each category in the discrete group variable.

count() is a wrapper around a grouped summary using n().

customer_data |>
  group_by(gender) |>
  summarize(
    n = n()
  )
## # A tibble: 3 × 2
##   gender     n
##   <chr>  <int>
## 1 Female  5219
## 2 Male    4214
## 3 Other   1098

customer_data |>
  count(gender)
## # A tibble: 3 × 2
##   gender     n
##   <chr>  <int>
## 1 Female  5219
## 2 Male    4214
## 3 Other   1098

We can group by more than one discrete variable.

customer_data |>
  group_by(gender, region) |>
  summarize(
    n = n(),
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 5
## # Groups:   gender [3]
##    gender region        n avg_income avg_credit
##    <chr>  <chr>     <int>      <dbl>      <dbl>
##  1 Other  Midwest     124    154637.       663.
##  2 Male   Midwest     420    152467.       666.
##  3 Other  Northeast   337    150564.       665.
##  4 Male   Northeast  1285    150498.       665.
##  5 Male   West       2079    149453.       667.
##  6 Other  West        519    144420.       667.
##  7 Female Midwest     557    134083.       671.
##  8 Female West       2497    133819.       668.
##  9 Female Northeast  1602    133333.       669.
## 10 Other  South       118    119068.       660.
## 11 Male   South       430    117988.       669.
## 12 Female South       563    105888.       664.

We can also use slice_*() functions along with group_by().

customer_data |>
  group_by(gender, region) |>
  slice_max(income, n = 3)
## # A tibble: 36 × 14
## # Groups:   gender, region [12]
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        6119       1984 Female 315000   698. No      Yes            Midwest  
##  2        1299       1992 Female 306000   610. No      Yes            Midwest  
##  3        7139       1957 Female 302000   727. No      Yes            Midwest  
##  4        1040       1993 Female 356000   672. Yes     Yes            Northeast
##  5        6503       1997 Female 348000   599. No      Yes            Northeast
##  6       11249       1992 Female 343000   620. No      Yes            Northeast
##  7        2075       1989 Female 374000   578. No      Yes            South    
##  8        7128       1966 Female 301000   708. Yes     Yes            South    
##  9        4756       1977 Female 293000   790. No      Yes            South    
## 10        7366       1993 Female 376000   682. No      Yes            West     
## # ℹ 26 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

Time Series Data

We often want to see how a variable changes over time: a time series. However, dates and times can be tricky.

customer_data |> 
  ggplot(aes(x = review_time, y = star_rating)) +
  geom_line()

## Warning: Removed 7372 rows containing missing values or values outside the scale range
## (`geom_line()`).

There’s a package for that!

rating_data <- customer_data |> 
  drop_na(star_rating) |> 
  select(review_time, star_rating) |> 
  mutate(review_time = mdy(review_time))

rating_data
## # A tibble: 3,159 × 2
##    review_time star_rating
##    <date>            <dbl>
##  1 2015-06-11            4
##  2 2008-03-25            5
##  3 2013-06-07            2
##  4 2016-04-20            5
##  5 2015-10-18            5
##  6 2015-01-06            5
##  7 2017-04-22            5
##  8 2014-09-11            4
##  9 2017-09-19            4
## 10 2013-12-12            5
## # ℹ 3,149 more rows

rating_data |> 
  ggplot(aes(x = review_time, y = star_rating)) +
  geom_line()

Visualize Grouped Summaries

Let’s summarize the data by a period of time and then plot the time series.

rating_data |> 
  mutate(review_year = year(review_time)) |> 
  group_by(review_year) |> 
  summarize(avg_star_rating = mean(star_rating)) |> 
  ggplot(aes(x = review_year, y = avg_star_rating)) +
  geom_line()

Visualize Continuous and Discrete Data

Just like there are geoms for visualizing continuous or discrete data, there are geoms for visualizing the relationship between continuous and discrete data.

customer_data |> 
  ggplot(aes(x = income, y = gender)) +
  geom_boxplot()

customer_data |> 
  ggplot(aes(x = income, fill = gender)) +
  geom_density(alpha = 0.5)

Embracing the Grammar of Graphics

Visualize the relationship between income and credit.

  • Income is hard to read, let’s recode it.
  • Map gender to the color argument.
  • Modify the size and alpha arguments.
  • Add geom_smooth().
  • Can we get facets for each combination of region and gender?
  • What happens when the color aesthetic is set in geom_point()?
  • Add labels and modify the scale colors.
  • Remove the gray default background by using a different theme. Try theme_minimal().
  • The legend for the scale is redundant, let’s modify it using the legend.position argument in theme().

customer_data |> 
  mutate(income = income / 1000) |> 
  ggplot(aes(x = income, y = credit)) +
  geom_point(size = 3, alpha = 0.5, aes(color = gender, )) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_grid(gender ~ region) +
  labs(
    title = "Income and Credit by Region and Gender",
    x = "Income (in Thousands)",
    y = "Credit"
  ) +
  scale_color_manual(
    name = "Gender",
    values = c("violet", "purple", "turquoise")
  ) +
  theme_minimal() +
  theme(legend.position = "none")

## `geom_smooth()` using formula = 'y ~ x'

Wrapping Up

Summary

  • Computed grouped summaries.
  • Practiced plotting with {ggplot2}, including some advanced options.

Next Time

  • Creating reports (and almost anything else) using Quarto.
  • Tidying data and the philosophy behind “tidy” data.

Supplementary Material

  • R for Data Science (2e) Chapters 2 and 19

Artwork by @allison_horst

Exercise 4

In RStudio on Posit Cloud, create a new R script and do the following.

  1. Load the tidyverse.
  2. Import and join customer_data and store_transactions.
  3. Explore this combined dataset using the functions we’ve covered.
  4. Provide at least one interesting numeric summary and one interesting visualization that include continuous variables.
  5. Practice good coding conventions as discussed.
  6. Export the R script and upload to Canvas.